Load tidyverse and tidymodels libraries and set the theme (optional).

library(tidyverse)         # for reading in data, graphing, and cleaning
library(lubridate)         # for date manipulation
library(tidymodels)        # for modeling
library(moderndive)        # for King County housing data
theme_set(theme_minimal()) # my favorite ggplot2 theme :)

Read in the King County Housing data and take a look at the first 5 rows.

data("house_prices")

house_prices %>% 
  slice(1:5)

According to the house_prices documentation, “This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. This dataset was obtained from Kaggle.com.” The description of the variables in the dataset in the documentation seem to be a little off. A more accurate description is provided in the image below.

Exploration

Take a quick look at distributions of all the variables to check for anything irregular.

Quantitative variables:

house_prices %>% 
  select_if(is.numeric) %>% 
  pivot_longer(cols = everything(),names_to = "variable", values_to = "value") %>% 
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(vars(variable), scales = "free")

Things I noticed and pre-processing thoughts: * Right-skewness in price and all variables regarding square footage –> log transform if using linear regression. * Many 0’s in sqft_basement, view, and yr_renovated –> create indicator variables of having that feature vs. not, ie. a variable called basement where a 0 indicates no basement (sqft_basement = 0) and a indicates a basement (sqft_basement> 0). * Age of home may be a better, more interpretable variable than year built -->age_at_sale = year(date) - yr_built`.

house_prices %>% 
  select_if(is.factor) %>% 
  pivot_longer(cols = everything(),names_to = "variable", values_to = "value") %>% 
  ggplot(aes(x = value)) +
  geom_bar() +
  facet_wrap(vars(variable), scales = "free", nrow = 2)

Things I noticed and pre-processing thoughts: * condition and grade both have levels with low counts –> make fewer categories.
* zipcode has many unique levels –> don’t use that variable.

The only other variables are id (not used in modeling), date, and waterfront. We might consider using the month the house was sold as a variable.

Data splitting

First, we split the data into training and testing datasets. We use the training data to fit different types of models and to tune parameters of those models, if needed. The testing dataset is saved for the very end to compare a small subset of models. The initial_split() function from the rsample library (part of tidymodels) is used to create this split. We just do random splitting with this dataset, but there are other arguments that allow you to do stratified sampling. Then we use training() and testing() to extract the two datasets, house_training and house_testing.

set.seed(327) #for reproducibility

# Randomly assigns 75% of the data to training.
house_split <- initial_split(house_prices, 
                             prop = .75)
house_split
## <16210/5403/21613>
#<training/testing/total>

house_training <- training(house_split)
house_testing <- testing(house_split)

Later, we will use 5-fold cross-validation to evaluate the model and tune model parameters. We set up the five folds of the training data using the vfold_cv() function. We will explain this in more detail later.

set.seed(1211) # for reproducibility
house_cv <- vfold_cv(house_training, v = 5)

Model preprocessing: recipe()s and step_xxx()s

house_recipe <- recipe(price ~ ., #short-cut, . = all other vars
                       data = house_training) %>% 
  # Pre-processing:
  # Remove, redundant to sqft_living and sqft_lot
  step_rm(sqft_living15, sqft_lot15) %>%
  # log sqft variables & price
  step_log(starts_with("sqft"),-sqft_basement, price, 
           base = 10) %>% 
  # new grade variable combines low grades & high grades
  # indicator variables for basement, renovate, and view 
  # waterfront to numeric
  # age of house
  step_mutate(grade = as.character(grade),
              grade = fct_relevel(
                        case_when(
                          grade %in% "1":"6"   ~ "below_average",
                          grade %in% "10":"13" ~ "high",
                          TRUE ~ grade
                        ),
                        "below_average","7","8","9","high"),
              basement = as.numeric(sqft_basement == 0),
              renovated = as.numeric(yr_renovated == 0),
              view = as.numeric(view == 0),
              waterfront = as.numeric(waterfront),
              age_at_sale = year(date) - yr_built)%>% 
  # Remove sqft_basement, yr_renovated, and yr_built
  step_rm(sqft_basement, yr_renovated, yr_built) %>% 
  # Create a month variable
  step_date(date, features = "month") %>% 
  # Make these evaluative variables, not included in modeling
  update_role(all_of(c("id","date","zipcode", 
                       "lat", "long")),
              new_role = "evaluative") %>% 
  # Create indicator variables for factors/character/nominal
  step_dummy(all_nominal(), all_predictors(), 
             -has_role(match = "evaluative"))

Apply to training dataset, just to see what happens. Notice the names of the variables.

house_recipe %>% 
  prep(house_training) %>%
  juice() 

Defining the model and creating workflows

Now that we have split and pre-processed the data, we are ready to model! First, we will model price (which is actually now log(price)) using simple linear regression.

We will do this using some modeling functions from the parsnip package. Find all available functions here. Here is the detail for linear regression.

In order to define our model, we need to do these steps:

house_linear_mod <- 
  # Define a linear regression model
  linear_reg() %>% 
  # Set the engine to "lm" (lm() function is used to fit model)
  set_engine("lm") %>% 
  # Not necessary here, but good to remember for other models
  set_mode("regression")

This is just setting up the process. We haven’t fit the model to data yet, and there’s still one more step before we do - creating a workflow! This combines the preprocessing and model definition steps.

house_lm_wf <- 
  # Set up the workflow
  workflow() %>% 
  # Add the recipe
  add_recipe(house_recipe) %>% 
  # Add the modeling
  add_model(house_linear_mod)

house_lm_wf
## ══ Workflow ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ──────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## ● step_rm()
## ● step_log()
## ● step_mutate()
## ● step_rm()
## ● step_date()
## ● step_dummy()
## 
## ── Model ─────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Modeling and evaluating

Now we are finally ready to fit the model! After all that work, this part seems easy. We first use the fit() function to fit the model, telling it which data set we want to fit the model to. Then we use some other functions to display the results nicely.

house_lm_fit <- 
  # Tell it the workflow
  house_lm_wf %>% 
  # Fit the model to the training data
  fit(house_training)

# Display the results nicely
house_lm_fit %>% 
  pull_workflow_fit() %>% 
  tidy() %>% 
  mutate_if(is.numeric, ~round(.x,3))

To evaluate the model, we will use cross-validation (CV), specifically 5-fold CV. (I am guessing we don’t have to do both the previous step of fitting a model on the training data AND this step, but I couldn’t figure out how to extract the final model from the CV data … so this was my solution for now.) So, we fit the model using the 5-fold dataset we created at the beginning. For a deeper discussion of cross-validation, I recommend Bradley Boehmke’s Resampling section of Hands on Machine Learning with R.

set.seed(456) # For reproducibility

house_lm_fit_cv <-
  # Tell it the workflow
  house_lm_wf %>% 
  # Fit the model (using the workflow) to the cv data
  fit_resamples(house_cv)

# The evaluation metrics for each fold:
house_lm_fit_cv %>% 
  select(id, .metrics) %>% 
  unnest()
# Evaluation metrics averaged over all folds:
collect_metrics(house_lm_fit_cv)
# Just to show you where the averages come from:
house_lm_fit_cv %>% 
  select(id, .metrics) %>% 
  unnest() %>% 
  group_by(.metric, .estimator) %>% 
  summarize(mean = mean(.estimate),
            n = n(),
            std_err = sd(.estimate)/sqrt(n))

Predicting and evaluating testing data

In this simple scenario, we may be interested in seeing how the model performs on the testing data that was left out. The code below will fit the model to the training data and apply it to the testing data. There are other ways we could have done this, but the way we do it here will be useful when we start using more complex models where we need to tune model parameters.

After the model is fit and applied, we collect the performance metrics and display them and show the predictions from the testing data.

house_lm_test <- 
  # The modeling work flow
  house_lm_wf %>% 
  # Use training data to fit the model and apply it to testing data
  last_fit(house_split)

# performance metrics from testing data
collect_metrics(house_lm_test)
# predictions from testing data
collect_predictions(house_lm_test)

The code below creates a simple plot to examine predicted vs. actual price from the house data.

collect_predictions(house_lm_test) %>% 
  ggplot(aes(x = price, y = .pred)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, intercept = 0, color = "darkred") +
  labs(x = "Actual log(price)", y = "Predicted log(price)")

collect_predictions(house_lm_test) %>% 
  ggplot(aes(x = 10^price, y = 10^.pred)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, intercept = 0, color = "darkred") +
  labs(x = "Actual price", y = "Predicted price")

How will the model be used?

When we use create models, it is important to think about how the model will be used and specifially how the model could do harm.

More complex model with tuning parameters